Augmenting Layer-Based Object Detection With Deep Convolutional Neural Networks

ABSTRACT

By way of example, the technology disclosed by this document receives image data; extracts a depth image and a color image from the image data; creates a mask image by segmenting the depth image; determines a first likelihood score from the depth image and the mask image using a layered classifier; determines a second likelihood score from the color image and the mask image using a deep convolutional neural network; and determines a class of at least a portion of the image data based on the first likelihood score and the second likelihood score. Further, the technology can pre-filter the mask image using the layered classifier and then use the pre-filtered mask image and the color image to calculate a second likelihood score using the deep convolutional neural network to speed up processing.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application a continuation-in-part of U.S. patentapplication Ser. No. 14/171,756, titled “Efficient Layer-Based ObjectRecognition”, filed Feb. 3, 2014, the entire contents of which areincorporated herein by reference.

BACKGROUND

The present disclosure relates to object recognition using layer-basedobject detection with deep convolutional neural networks.

Today many computer systems and machines rely on person recognitiontechniques for various different applications. In some exampleapplications, machines and computer systems need to know if there is ahuman present (or which human is present) at a particular location inorder to turn on/off or activate a particular program. Person detectionin particular is often a fundamental skill in human robot interaction.In general, a robot needs to know where the person is in order tointeract with them.

While some progress has been made at detecting people in public places(e.g., see P. F. Felzenszwalb, R. B. Girshick, D. McAllester and D.Ramanan, “Object Detection with Discriminatively Trained Part BasedModels,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 32, no. 9, pp. 1627-1645, 2010; and T. Linder and Arras K. O.,“People Detection, Tracking and Visualization using ROS on a MobileService Robot,” in Robot Operating System (ROS): The Complete Reference,Springer, 2016), in other domains, such as a home environment, thechallenges are particularly difficult.

One solution that has been developed to improve object recognition is touse a layer-based classification/object detection system todifferentiate between classes of objects. The layer-based classificationuses a segmented depth image to differentiate between two or moreclasses of people. However, one common error that is present in thelayer-based classification using depth images, especially in a movingobject detection system (such as a robot) when the system approaches asquare object at an off angle (e.g. 45 degrees), then that object willappear curved in the depth image, making it difficult to distinguishfrom people. In moving object detection systems, a false positiveclassification that happens when a robot approaches an object at anunusual angle can result in a robot becoming stuck.

Another solution that has been developed to improve object recognitionis to use a deep convolutional neural network to classify objects and/orimages. The use of deep convolutional neural networks to classifyobjects and/or images is a relatively recent phenomenon. Although thealgorithm itself is many decades old, there has been significant recentwork in optimizing these algorithms for large data sets and improvingtheir speed and precision. Most notably, work published by Krizhevsky,Sutskever and Hinton at the University of Toronto detailed a specificnetwork architecture, referred to as “AlexNet,” that performed well onthe large object recognition challenge, ImageNet. See A. Krizhevsky, I.Sutskever and G. Hinton, “ImageNet Classification with DeepConvolutional Neural Networks,” in Neural Information Processing (NIPS),Lake Tahoe, Nev., USA, 2012.

The deep convolutional neural network often utilizes RGB images toclassify objects and/or images. While recent improvements to the deepconvolutional neural network have shown success at large object imagerecognition as well as increasing the size of the training set andtolerance of noise, the deep convolutional neural network suffers from asignificant weakness. The deep convolutional neural network is overlyreliant on single sending modality (e.g. RGB image data). Not only issegmenting in RGB much more difficult and computationally expensive, butthe classifier itself emphasizes learning a decision boundary based onedges and textures, features that may not be the only, or even the best,choice depending on the sensing modality and the object beingrecognized.

In addition, AlexNet, however, does not solve the segmentationproblem—when restricted to color images, other algorithms like graphcuts were used to extract object-bounding boxes and then classified. SeeR. Girshick, J. Donahue, T. Darrell and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,” inComputer Vision and Pattern Recognition, Columbus, Ohio, 2014.Alternatively, there has been a notable effort to move beyond the singlemodality limitations by incorporating depth. See C. Couprie, C. Farabet,L. Najman and Y. LeCun, “Convolutional Nets and Watershed Cuts forReal-Time Semantic Labeling of RGBD Videos,” Journal of Machine LearningResearch, vol. 15 (October), pp. 3489-3511, 2014.

Couprie et al exponentially reduce the number of bounding boxes toevaluate by applying watershed cuts to depth images for imagesegmentation prior to RGB classification. Gupta et al go step further byincluding the depth data in the segmentation and classification step.See S. Gupta, R. Girshick, P. Arbeláez and J. Malik, “Learning RichFeatures from RGB-D Images for Object Detection and Segmentation,” inEuropean Conference on Computer Vision, Zurich, Switzerland, 2014. Theirwork, however, requires knowledge of the camera orientation in order toestimate both height above ground and angle with gravity for every pixelin the image for use with AlexNet.

Within the domain of person detection there is also multimodal fusionwork focused on improving specific classifiers using a combination ofRGB and depth information. People can be detected in depth data alone,as demonstrated by previous work in layered person detection and contourestimation, as well as and they can be detected in monocular cameradata, either color or grayscale images. See E. Martinson, “DetectingOccluded People for Robotic Guidance,” in Robots and Human InteractiveCommunication (RO-MAN), Edinburgh, U K, 2014; L. Spinello, K. Arras, R.Triebel and R. Siewart, “A Layered Approach to People Detection in 3DRange Data,” in Proc. of the AAAI Conf. on Artificial Intelligence:Physically Grounded AI Track, Atlanta, Ga., 2010; and N. Kirchner, A.Alempijevic, A. Virgona, X. Dai, P. Ploger and R. Venkat, “A RobustPeople Detection. Tracking and Counting System,” in Australian Conf. onRobotics and Automation, Melbourne, Australia, 2014.

But the advantage of using the two modalities is that the failure pointsfor depth-based recognition are not the same as the failure points forcolor based recognition. Given a registered color and depth image, anumber of systems have been developed to take advantage of the fusion ofthese two modalities.

The method described by Spinello and Arras (Univ. of Freiburg) fusesthese two modalities by applying similar classifiers in each domain. SeeL. Spinello and K. Arras, “People Detection in RGB-D Data,” in Int.Conf. on Intelligent Robots and Systems (IROS), San Francisco, USA,2011. The depth image is used to first identify regions of interestbased on groups of neighboring pixels. Then the histogram of orientedgradients, originally developed for object recognition in RGB images,and widely used in color based person detection, is calculated forregions of interest in the color image. A second, related algorithm, thehistogram of oriented depths, is then applied to the depth imageobjects, and the resulting combined vector is classified using a supportvector machine. More recent work from Freiburg (see above) integratesother publically available detectors including one included with thepoint cloud library. See M. Munaro, F. Basso, E. Menegatti., “Trackingpeople within groups with RGB-D data,” in International Conference onIntelligent Robots and Systems (IROS) 2012, Villamoura, Portugal, 2012.

Another related RGB-D classification system was published by theUniversity of Michigan, whereby additional modalities such as motioncues, skin color, and detected faces are also added to a combinedclassification system. See W. Choi, C. Pantofaru, S. Savarese,“Detecting and Tracking People using an RGB-D Camera via MultipleDetector Fusion,” in Workshop on Challenges and Opportunities in RobotPerception (in conjunction with ICCV-11), 2011. Although both of thesemethods make use of classifiers from both the RGB and depth domain,neither one takes advantage of the precision increase a convolutionalneural network can enable. Where the first method uses two very similarclassifiers (HOG vs HOD) to handle the cross-domain fusion, the systemis learning the same decision boundary and will fail when that decisionboundary is difficult to identify. The second method, by contrast,employs a variety of different detectors across different domains.However, the majority (e.g. motion cues, skin color, and face detection)are very weak classifiers in the general detection problem, as opposedto convolutional neural networks.

Therefore, a solution is needed that both reduces errors in classifyingdepth images, without the increased process time and computationaldifficulty of the convolutional neural network.

SUMMARY

According to one innovative aspect of the subject matter described inthis disclosure, a system includes one or more processors and one ormore memories storing instructions that, when executed by the one ormore processors, cause the system to: receive image data; extract adepth image and a color image from the image data; create a mask imageby segmenting the depth image; determine a first likelihood score fromthe depth image and the mask image using a layered classifier; determinea second likelihood score from the color image and mask image using adeep convolutional neural network; and determine a class of at least aportion of the image data based on the first likelihood score and thesecond likelihood score.

In general, another innovative aspect of the subject matter described inthis disclosure may be embodied in methods that include receiving imagedata; creating a mask image by segmenting the image data into aplurality of components; determining a first likelihood score from theimage data and the mask image using a layered classifier; determining asecond likelihood score from the image data and the mask image using adeep convolutional neural network (CNN); and determining a class for atleast a portion of the image data based on the first likelihood scoreand the second likelihood score.

Other aspects include corresponding methods, systems, apparatus, andcomputer program products for these and other innovative aspects. Theseand other implementations may each optionally include one or more of thefollowing features and/or operations. For instance, the features and/oroperations include: extracting a first image from the image data;generating an object image by copying pixels from the first image of thecomponents in the mask image; classifying the object image using thedeep CNN; generating classification likelihood scores indicatingprobabilities of the object image belonging to different classes of thedeep CNN; generating the second likelihood score based on theclassification likelihood scores; that the first image is one of a colorimage, a depth image, and a combination of a color image and a depthimage; fusing the first likelihoods score and the second likelihoodsscore together into an overall likelihood score and responsive tosatisfying a predetermined threshold with the overall likelihood score,classifying the at least the portion of the image data as representing aperson using the overall likelihood score; extracting a depth image anda color image from the image data; that determining the first likelihoodscore from the image data and the mask image using the layeredclassifier includes determining the first likelihood score from thedepth image and the mask image using the layered classifier, anddetermining the second likelihood score from the image data and the maskimage using the deep CNN includes determining the second likelihoodscore from the color image and the mask image using the deep CNN; thatthe deep CNN has a soft max layer as a final layer to generate thesecond likelihood score that the at least the portion of the image datarepresents a person; converting the first likelihood score and thesecond likelihood score into a first log likelihood value and a secondlog likelihood value; and calculating a combined likelihood score byusing a weighted summation of the first log likelihood value and thesecond log likelihood value; that the class is a person.

The novel detection technology presented in this disclosure isparticularly advantageous in a number of respects. For example, thetechnology described herein can increase precision of the differentsensors across different environments without sacrificing recall. Evenwithin domains in which individual classifiers have been demonstrated tobe particularly strong, the fusion of the layered classifier andconvolutional neural network improves performance. Further, thetechnology described herein can increase precision by using differenttypes of decision boundaries learned by the layered classifier andconvolutional neural network. Where the layered system focuses on thegeometry of the layer of pixels in image data, the neural networkemphasizes boundaries and contours. Further, the disclosed technologycan achieve increased precision and identification in a wider variety ofenvironments without rendering false positives as with the solutionsdiscussed in the Background.

The disclosure is illustrated by way of example, and not by way oflimitation in the figures of the accompanying drawings in which likereference numerals are used to refer to similar elements.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an example system for recognizing imageobjects.

FIG. 2A is a block diagram of an example computing device.

FIG. 2B is a block diagram of an example detection module.

FIG. 3 is a flowchart of an example method for recognizing imageobjects.

FIGS. 4A and 4B are flowcharts of a further example method forrecognizing image objects.

FIG. 5 is a diagram of an example method for detecting people blobs,slicing the people blobs into layers, and comparing the layers existinguser models to recognize the people associated with the people blobs.

FIG. 6 depicts example blobs extracted from an example depth image.

FIG. 7 is a diagram showing an example segmentation of a person blobinto a plurality of layers.

FIG. 8 is a table describing various non-limiting advantages of thenovel layer-based detection technology disclosed herein.

FIG. 9 depicts example layers extracted from an upright person blobprocessed from a depth image.

FIGS. 10A-10B are graphs showing an example comparison between twodifferent types of sensors.

FIG. 10C is a graph showing a blob-level comparison between the noveltechnology described herein and another alternative.

FIG. 11 illustrates an example application of the detection technology.

FIGS. 12A-12C depict representations of example images captured by asensor.

FIG. 13 is a flowchart of a further example method for recognizing imageobjects.

FIGS. 14A-14B are flowcharts of a further example method for recognizingimage objects.

FIG. 15 is a diagram of an example method for segmenting blobs,classifying the blobs, and generating a score of the blobs.

FIG. 16 is a block diagram of an example image classification system.

FIGS. 17A-17C are graphs and data showing evaluations using differentcombinations of classifiers.

FIG. 18 is a block diagram of an example image classification systemusing a pre-filter layered classifier.

FIG. 19 is graph data showing evaluations using the pre-filter layeredclassifier.

DESCRIPTION

The detection technology described herein can efficiently andeffectively detect and recognize unique objects such as people andnon-people from image data, such as depth images, color images, etc. Inan example embodiment, the technology fuses information relating to theboth the depth images and color images into a single likelihood scoreusing a layered classifier and a convolutional neural network (CNN)classifier. The technology can advantageously recognize objects (e.g.,people) in depth images even when important aspects of those objects(e.g., a person's head or shoulders) are occluded. In essence, anocclusion means that a part of an object (e.g., a portion of a person'sbody) in a scene being recorded is blocked from view (e.g., from thecamera's perspective). Occlusions may be caused by a number ofvariables. For instance, occlusions may be caused by, but are notlimited to: 1) physical objects in frame that block part of the personfrom view; 2) the edge of the camera image plane; and 3) other artifactsthat may block or obscure objects in images from being visible, such aslighting, focus, noise, etc., during capture, etc.

In a non-limiting example, the detection technology includescomputer-implemented algorithms that recognize objects in depth imagesby comparing object segments (also called layers) to reference surfaces(e.g., a convex parabola and a line). The detection technology mayrequire a minimum of parameters be estimated for each 2D segment of theobject, from which a classifier can be rapidly trained to separatepeople from objects. Combining the 2D scans can then result in >90%recognition precision.

The detection technology is applicable in numerous areas, including inexploring people detection in real environments, extending recognitionto people carrying objects, and working in non-upright poses includingpeople lying down or bent over. In general, is to make detection work inevery environment that a person might want to interact with anintelligent computing device.

FIG. 1 is a block diagram of an example system 100 for recognizingobjects. As illustrated, the system 100 may include a computation server101 and/or a detection system 103 that may be accessed and/or interactedwith by a user 125 (as depicted by signal line 118). Depending on theimplementation, the system may or may not include a computation server101. In embodiments where a computation server 101 is included, thedetection system 103 and the computation server 101 may becommunicatively coupled via a network 105 via signal lines 106 and 108,respectively. For example, the detection system 103 and the computationserver 101 may be communicatively coupled to each other via the network105 to exchange data, such as sensor data, recognition data, etc. Thesignal lines 106 and 108 in FIG. 1 may be representative of one or morewired and/or wireless connections. As a further example, the detectionsystem 103 may transmit sensor data to the computation server 101 forprocessing and the computation server 101 may process the data asdescribed herein to detect and recognize objects and send data and/orresults describing the recognized objects to the detection system 103for use thereby during operation. In embodiments where a computationserver 101 is not included, the detection system 103 may operateautonomously or in conjunction with other detection systems 103 (notvisible) to detect and recognize objects. For instance, a detectionsystem 103 may be networked via a computer network with other similardetection systems 103 to perform the computations discussed herein.

While FIG. 1 depicts a single detection system 103 and computationserver 101, it should be understood that a variety of different systemenvironments and configurations are possible, contemplated, and withinthe scope of the present disclosure. For instance, some embodiments mayinclude additional or fewer computing devices, services, and/ornetworks, and may implement various functionality locally or remotely onother computing devices. Further, various entities may be integratedinto to a single computing device or system or distributed acrossadditional computing devices or systems, etc. For example, the detectionmodule 135 may be stored in, executable by, and distributed across acombination of computing devices and/or systems or in one computingdevice and/or system.

The network 105 may include a conventional type network, wired orwireless, and may have any number of configurations, such as a starconfiguration, token ring configuration, or other known configurations.The network 105 may include one or more local area networks (“LANs”),wide area networks (“WANs”) (e.g., the Internet), virtual privatenetworks (“VPNs”), peer-to-peer networks, near-field networks (e.g.,Bluetooth™), cellular networks (e.g., 3G, 4G, other generations), and/orany other interconnected data path across which multiple computing nodesmay communicate. Data may be transmitted in encrypted or unencryptedform between the nodes of the network 105 using a variety of differentcommunication protocols including, for example, various Internet layer,transport layer, or application layer protocols. For example, data maybe transmitted via the networks using transmission controlprotocol/Internet protocol (TCP/IP), user datagram protocol (UDP),transmission control protocol (TCP), hypertext transfer protocol (HTTP),secure hypertext transfer protocol (HTTPS), dynamic adaptive streamingover HTTP (DASH), real-time streaming protocol (RTSP), real-timetransport protocol (RTP) and the real-time transport control protocol(RTCP), voice over Internet protocol (VOW), file transfer protocol(FTP), WebSocket (WS), wireless access protocol (WAP), various messagingprotocols (SMS, MMS, XMS, IMAP, SMTP, POP, WebDAV, etc.), or other knownprotocols.

The detection system 103 may be representative of or included in anautonomous computing system capable of perceiving, recognizing, andinterpreting the significance of objects within its environment toperform an action. For example, the detection system 103 may berepresentative of or incorporated into an intelligent car having thecapability of recognizing a particular driver or passenger inside thecar. In further examples, the detection system 103 may be representativeof or incorporated into a social robot that can cooperate with humansand/or other robots to perform various tasks, or an autonomous systemoperating in populated environments. In some embodiments, the detectionsystem 103 may be incorporated in other systems as a component fordetecting and recognizing objects. For instance, the detection system103 may be incorporated into a client device such as a gaming system,television, mobile phone, tablet, laptop, workstation, server, etc. Forexample, the detection system 103 may be embedded in a machine orcomputer system for determining if a certain person or persons arepresent at a particular location and the machine or computer system canturn on/off or execute a particular program if that certain person orpersons are present at the particular location.

In some embodiments, the detection system 103 may include a sensor 155,a computation unit 115 that includes a processor 195 and an instance 135a of the detection module, a storage device 197 storing a set of objectmodels 128, and/or an interface 175. As depicted, the sensor 155 iscommunicatively coupled to the computation unit 115 via signal line 122.The storage device 197 is communicatively coupled to the computationunit 115 via signal line 124. The interface 175 is communicativelycoupled to the computation unit 115 via signal line 126. In someembodiments, an instance 135 b of the detection module, or variouscomponents thereof, can be stored on and executable by the computationserver 101, as described elsewhere herein. The instances of thedetection module 135 a and 135 b are also referred to hereinindividually and/or collectively as the detection module 135.

Although single instances of each of the computation unit 115, sensor155, storage device 197 and interface 175 are depicted in FIG. 1, itshould be recognized that the detection system 103 can include anynumber of computation units 115, sensors 155, storage devices 197 and/orinterfaces 175. Furthermore, it should be appreciated that depending onthe configuration the detection system 103 may include other elementsnot shown in FIG. 1, such as an operating system, programs, variousadditional sensors, motors, movement assemblies, input/output deviceslike a speaker, a display device, a transceiver unit and an antenna forwireless communication with other with other devices (e.g., thecomputation server 101, other detection systems 103 (not shown), anyother appropriate systems (not shown) communicatively coupled to thenetwork 105, etc.

The sensor 155 may include one or more sensors configured to capturelight and other signals from the surrounding environment and to generateand/or processes sensor data, such as depth data, therefrom. Forinstance the sensor 155 may include a range camera, such as but notlimited to an RGB-D camera, a stereo camera, a structured lightcamera/scanner, time-of-flight camera, interferometer, modulationimager, a laser rangefinder, a light-field camera, an intensified CCDcamera, etc., although it should be understood that other types ofsensors may be used, such as but not limited to an ultrasound sensor, acolor camera, an infrared camera, etc. In some embodiments, the sensor155 and/or detection system 103 may include a combination of differenttypes of sensors, such as accelerometers, gyroscopes, thermometers,barometers, thermocouples, or other conventional sensing devices. SwissRanger sensor by MESA Imaging, Kinect sensor by Microsoft, variousstereo vision systems, etc., are further non-limiting examples ofcameras that the sensor 155 may include. The sensor 155 may beincorporated into the computation unit 115 or may be a disparate devicethat is coupled thereto via a wireless or wired connection.

In various embodiments, the sensor 155 may generate and send the depthdata describing distance information associated with objects captured bythe sensor 155 to the computation unit 115 and/or the computation server101 for processing, as described elsewhere herein.

The computation unit 115 may include any processor-based computingdevice, such as the computing device 200 depicted in FIG. 2A. In anembodiment, the computation unit 115 may receive sensor data from thesensor 155, process the sensor data, generate and/or provide results forpresentation via the interface 174 based on the processing, triggervarious programs based on the processing, control the behavior and/ormovement of the detection system 103 or associated systems based on theprocessing, cooperate with the computation server 101 to process thesensor data, etc., as described elsewhere herein. In some embodiments,the computation unit 115 may store the processed sensor data and/or anyresults processed therefrom in the storage device 197. The processor 195and the detection module 135 are described in detail with reference toat least FIGS. 2A-19.

The interface 175 is a device configured to handle communicationsbetween the user 125 and the computation unit 115. For example, theinterface 175 includes one or more of a screen for displaying detectioninformation to the user 125; a speaker for outputting sound informationto the user 125; a microphone for capturing sound and/or voice commands;and any other input/output components facilitating the communicationswith the user 125. In some embodiments, the interface 175 is configuredto transmit an output from the computation unit 115 to the user 125. Forexample, the interface 175 includes an audio system for playing a voicegreeting to the user 125 responsive to the computation unit 115detecting that the user 125 is within the vicinity. It should beunderstood that the interface 175 may include other types of devices forproviding the functionality described herein.

The user 125 may be a human user. In one embodiment, the user 125 isdriver or a passenger sitting in a vehicle on a road. In anotherembodiment, the user 125 is a human that interacts with a robot. In afurther embodiment, the user is a conventional user of a computingdevice. The user 125 may interact with, or otherwise provide inputs toand/or receives outputs from, the interface 175 which sends and receivesdifferent types of data to and from the computation unit 115.

The storage device 197 includes a non-transitory storage medium thatstores data. For example, the storage device 197 includes one or more ofa dynamic random access memory (DRAM) device, a static random accessmemory (SRAM) device, flash memory, a hard disk drive, a floppy diskdrive, a disk-based memory device (e.g., CD, DVD, Blue-Ray™, etc.), aflash memory device, or some other known non-volatile storage device.The storage device 197 may be included in the detection system 197 or inanother computing device and/or storage system distinct from but coupledto or accessible by the detection system 197. In some embodiments, thestorage device 197 may store data in association with a databasemanagement system (DBMS) operable by the detection system 103 and/or thecomputation server 101. For example, the DBMS could include a structuredquery language (SQL) DBMS, a NoSQL DMBS, etc. In some instances, theDBMS may store data in multi-dimensional tables comprised of rows andcolumns, and manipulate, i.e., insert, query, update and/or delete, rowsof data using programmatic operations.

The computation server 101 is any computing device having a processor(not pictured) and a computer-readable storage medium (e.g., a memory)(not pictured) to facilitate the detection system 103 to detect andrecognize objects. In some embodiments, the computation server 101includes an instance 135 b of the detection module. In network-basedembodiments, the computation server 101 receives sensor data (e.g.,depth data) from the detection system 103, processes the sensor data,and sends any result of processing to the detection system 103.

FIG. 2 is a block diagram of a computing device 200 that includes adetection module 135, a processor 195, a memory 237, a communicationunit 245, a sensor 155, and a storage device 197 according to theillustrated embodiment. The components of the computing device 200 arecommunicatively coupled by a bus 220. In some embodiments, the computingdevice 200 is representative of the architecture of a detection system103 and/or a computation server 101.

The memory 237 may store and provide access to data to the othercomponents of the computing device 200. In some implementations, thememory 237 may store instructions and/or data that may be executed bythe processor 195. For instance, the memory 237 may store the detectionmodule 135 and/or components thereof. The memory 237 is also capable ofstoring other instructions and data, including, for example, anoperating system, hardware drivers, other software applications,databases, etc. The memory 237 may be coupled to the bus 220 forcommunication with the processor 195 and the other components of thecomputing device 200.

The memory 237 includes one or more non-transitory computer-usable(e.g., readable, writeable, etc.) media, which can include an apparatusor device that can contain, store, communicate, propagate or transportinstructions, data, computer programs, software, code, routines, etc.,for processing by or in connection with the processor 195. In someimplementations, the memory 237 may include one or more of volatilememory and non-volatile memory. For example, the memory 237 may include,but is not limited, to one or more of a dynamic random access memory(DRAM) device, a static random access memory (SRAM) device, a discretememory device (e.g., a PROM, FPROM, ROM), a hard disk drive, an opticaldisk drive (CD, DVD, Blue-Ray™, etc.). It should be understood that thememory 237 may be a single device or may include multiple types ofdevices and configurations.

The communication unit 245 transmits data to and receives data fromother computing devices to which it is communicatively coupled usingwireless and/or wired connections. The communication unit 245 mayinclude one or more wired interfaces and/or wireless transceivers forsending and receiving data. The communication unit 245 may couple to thenetwork 105 and communicate with other computing nodes, such as thedetection system 103 and/or the computation server 101 (depending on theconfiguration). The communication unit 245 may exchange data with othercomputing nodes using standard communication methods, such as thosediscussed above regarding the network 105.

The detection module 135 may be coupled to the sensor 155 to receivesensor data. In some embodiments, the sensor data received from thesensor 155 may include image data such as depth data describing a depthimage, data describing a color image, other types of image data, etc.The image data may be an image depicting a scene including one or moreobjects. An object may be a living or a non-living object, an animate orinanimate object, etc. Example objects include but are not limitedhumans, animals, furniture, fixtures, cars, utensils, etc. The detectionmodule 135 can efficiently recognize objects by extracting blobsassociated with the objects, segmenting the blobs into layers,generating likelihoods, classifying the objects associated with theblobs using the layers and likelihoods, etc.

In various embodiments, the detection module 135 may be executable toextract one or more blobs representing one or more objects from thedepth data of the image data, classify the blobs as describing people ornon-people objects, segment the blobs into layers, compare the layers ofeach blob to a set of object models to determine the identity of theobject associated with the blob (e.g., recognize the specific individualassociated with a person blob), etc. In various further embodiments, thedetection module 135 may be executable to further extract a color image(e.g. RGB image) from the image data and use a deep convolutional neuralnetwork to determine edges and boundaries in the color image, determineblobs in the color image using the edges and boundaries generate aclassification score for the color image of a likelihood that the blobis a person, or in some embodiments a specific individual. Numerousfurther acts are also possible as discussed further elsewhere herein.

As shown in FIG. 2B, which depicts an example detection module 136, thedetection module 136 may include an image processor 202, a layersegmentation module 206, a classification module 208, the classificationmodule including both a layered classifier module 210 and a CNN module212, and a fusion module 214, although it should be understood that thedetection module 136 may include additional components such as aregistration module, a training module, etc., and/or that variouscomponents may be combined into a single module or divided intoadditional modules.

The image processor 202, the layer segmentation module 206, and/or theclassification module 208 may be implemented as software, hardware, or acombination of the foregoing. In some implementations, the imageprocessor 202, the layer segmentation module 206, and/or theclassification module 208 may be communicatively coupled by the bus 220and/or the processor 195 to one another and/or the other components ofthe computing device 200. In some implementations, one or more of thecomponents 135, 202, 204, and/or 206 are sets of instructions executableby the processor 195 to provide their functionality. In furtherimplementations, one or more of the components 135, 202, 204, and/or 206are stored in the memory 237 and are accessible and executable by theprocessor 195 to provide their functionality. In any of the foregoingimplementations, these components 135, 202, 204, and/or 206 may beadapted for cooperation and communication with the processor 195 andother components of the computing device 200.

The image processor 202 may be communicatively coupled to the sensor 155to receive sensor data and may process the sensor data to extract imagedata such as depth data and color data. In some embodiments, the imageprocessor 202 may extract blobs of objects depicted by the image. Insome embodiments, the sensor data may include depth image datadescribing the position of the objects relative to a point of reference.For example, the sensor 155 may include a multi-dimensional depth sensorthat generates multi-dimensional (e.g., 3D) data describing a depthimage including object(s) captured by the sensor 155. In someembodiments, the sensor data may include color data describing thecolors of different pixels in the image, such as an RGB image. The colorimage data may include RGB values for the pixels forming the object(s)in the image. In some cases, the depth image data may include positionalinformation associated with the object(s), such as a multi-dimensional(e.g., 3D) depth point cloud in form of an array of triplets or spatialcoordinates. In some cases, the depth image data may include the columnand row number of each pixel represent its X and Y coordinates and thevalue of the pixel represents its Z coordinate.

The image processor 202 may use the depth image data describing thedepth images captured by the sensor 155 to determine the discreteobject(s) included in the depth images. Using depth imaging can providevarious advantages including simplifying object segmentation. In depthimages, objects can often be separated from each other in the image bytheir relative depth. For instance, two adjacent pixels having the samerelative distance (as measured from a given point of reference such asthe sensor 155 location) are likely to belong to the same object, buttwo pixels having significantly different distances relative to thepoint of reference likely belong to different objects in the image. Thiscan be helpful to more easily distinguish freestanding objects from oneanother.

FIG. 6 demonstrates example object blobs 600 extracted from an exampledepth image 602. In particular, the depth image 602 depicts a man in acentral portion of the frame with another person leaning over a table onthe left of the frame and a chair on the right portion of the frame. Theimage processor 202 processes the data describing the depth image toextract blobs representing the objects, such as blob 604 representingthe table, blob 606 a representing the person leaning on the table, theblob 606 b representing the person in center frame, and blob 608representing the chair.

In some embodiments, to extract blobs from a depth image, the imageprocessor 202 may estimate positional data for the pixels in the depthimage using the distance data associated with those pixels and may applya flood fill algorithm to adjacent/connected pixels with correspondingdepths to determine blobs for the object(s) formed by the pixels. Insome embodiments, the image processor 202 may filter out any pixels inthe depth image that do not have any distance data associated with themprior to estimating the position data to reduce the amount of processingthat is required. Additionally or alternatively, after determining theblobs, the image processor 202 may filter out blobs that are smallerthan a certain size (e.g., 500 pixels) to eliminating processing ofblobs that are likely to be inconsequential.

In some further blob extraction embodiments, using the focal length ofthe camera (e.g., sensor 155) and the depth image, the image processor202 may extract Z, X, and Y coordinates (e.g., measured in feet, meters,etc.). The image processor 202 may then filter the depth image. Inparticular, the image processor 202 may remove all pixels that havedepth that do not fall within a certain rage (e.g., 0.5-6.0 meters).Assuming a point of reference (e.g., a planar surface on which thecamera is mounted that is parallel to the floor), the image processor202 may estimate the X and Y coordinates of each pixel using thefollowing formula:

${x( {{row},{col}} )} = \frac{( {{col}_{center} - {col}} )*{Z( {{row},{col}} )}}{focal\_ length}$${{y( {{row},{col}} )} = \frac{( {{row}_{center} - {row}} )*{Z( {{row},{col}} )}}{focal\_ length}},$

where focal length is the camera focal length. The image processor 202may then remove certain pixels based on their X, Y location.

The image processor 202 may then apply a flood fill algorithm to thepixels of the depth image connected together within a certain depththreshold (e.g., 5 cm) to produce blobs describing one or more object(s)in the depth image. The image processor 202 can remove blobs that aresmaller than a certain size (e.g., 500 pixels). The remaining blobs maythen be analyzed (e.g., classified, segmented into layers, and matchedto objects models), as discussed in further detail herein.

In some embodiments, the image processor 202 may pre-classify the blobsbased on their shape to reduce the number of blobs that ultimately needto be segmented and classified. This can reduce processing time whencertain types of blobs may not need to be processed. For instance, insome cases only people blobs may need to be recognized by the detectionmodule 135, and the image processor 202 may discard any blobs associatedwith non-people objects. The image processor 202 may pre-classifyobjects based on the overall shapes of their blobs. For instance, peopleblobs generally have certain unique human characteristics thatdistinguish them from non-person blobs, such as a head and shoulderregions, leg regions, arm regions, torso regions, etc. The imageprocessor 202 may analyze the blobs to determine whether theiroutlines/shapes include one or more of these human characteristics, andif so, may classify them as person object types.

The layer segmentation module 206 may be coupled to the image processor202, the memory 237, the communication unit 245, or another component toreceive data describing one or more blobs detected from the sensor data,may segment each blob into a set of layers, and may calculate one ormore properties associated with each layer and generate a mask image ofthe data to provide to the classification module 208. The mask imagedata may be data in which different pixels in the segments are groupedtogether (e.g. blobs) and labeled. A particular object, such as acertain person, can be uniquely described by the collection of thelayers (e.g., horizontal slices). The segmentation performed by thelayer segmentation module 206 is also referred to herein in variousplaces as slicing or transforming. The set of layers may be a series ofcontiguous segments extending from one side of the blob to another. Theset of layers may represent a sampling of various portions of the blob.This provides the benefit of maintaining a highly accurate recognitionrate while increasing computational efficiency by not processing theentire blob. In some embodiments, the segments are substantiallyparallel and have a predetermined thickness. The spacing betweensegments may be uniform, non-uniform, or random in nature depending onthe embodiment. In some embodiments, the layers may be horizontallayers. In other embodiments, the layers may be vertical or diagonallayers. Also, while the layers are depicted as being substantially flat,it should be understood that layers that are non-flat may also be used.

As a further example, for a person blob, the layer segmentation module206 may segment the blob in locations estimated to be most relevant. Forinstance, a person blob may be segmented in locations corresponding tovarious notable body parts, as shown in FIG. 7 and described in furtherdetail below. The layer segmentation module 206 may in some casesfilter-out layers having a length that does not meet a certain minimumlength (e.g., five pixels in length) and may process each of theremaining layers/segments to determine geometric properties for thatlayer. In some embodiments, the layer segmentation module 206 may basethe segmentation scheme applied on the object type pre-classified by theimage processor 202. The object type may be received from the imageprocessor 202, the memory 237, the communication unit 245, or anothercomponent.

FIG. 7 is a diagram showing an example segmentation of a person blob 700into a plurality of layers 701. In some embodiments, the layersegmentation module 206 may segment the person blob 700 in regions thatcorrespond to notable body parts. For instance, the set of layers 701determined by the layer segmentation module 206 may include one or moreof a head layer 702, a face layer 704, a neck layer 706, a chest layer708, an arm layer 710, a stomach layer 712, a pelvis layer 714, a thighlayer 716, a knee layer 718, and a foot layer 720. For each of thelayers 702 . . . 720, the layer segmentation module 206 may determineone or more properties. In some embodiments, a property may describe thesize, curvature, a curve fit, a shape fit, etc., associated with thelayer, as described elsewhere herein.

The artificial environments that people live in, whether inside oroutside in the city, include many more surfaces that are flat or atleast much less curved than people. People by contrast, are generallycurved all over. For instance, people's heads, arms, legs, chest, etc.,generally all have some curvature to them. As a result, even if onlypart of a person is visible, that part is likely to have some curvature.Accordingly, the segmentation module 204 can process the layerssegmented by it from a given blob to determine their unique curvatureproperties, which are then used by the classification module 208 toidentify the blob. As few as six properties in some cases may accuratelycharacterize curvature associated with a blob at close range.

In some embodiments, the layer segmentation module 206 may determinelayer curvature by fitting a line and a parabola to the data pointsassociated with the layer in the X and Z dimensions. FIG. 9 depictsexample layers extracted from an upright person blob 902 processed froma depth image. As shown, the parabolas 903 and 905 in the head slicegraph 900 and the torso slice graph 904, respectively, show significantcurvature. Unlike the head slide graph 900 and the torso slice graph904, the counter slice graph 906, which is based on a layer taken from ablob associated with a counter (not portrayed), includes a parabola 907that is substantially flat in nature and fits closer to the line. Insome implementations, the layer segmentation module 206 may use apolyfit algorithm extended to double floating-point precision to findthe best line (L) and parabola (P) equations describing the data,although other suitable polynomial fitting algorithms may also be used.

The layer segmentation module 206 may then use the L and P equationsassociated with each layer may to determine a set of geometricproperties associated with that layer. In some embodiments, the set ofgeometric properties may be represented as a six dimensional vectorincluding the following elements:

-   -   1. Δc, which is the line depth at the center of the segment        minus the center parabola depth as a measure of concavity        L(x_(μ))−P(x_(μ));    -   2. RMSE_(L), which is the root mean squared error of the fitted        line equation;    -   3. σ_(L), which is the standard deviation of the fitted line        equation;    -   4. RMSE_(P), which is the root mean squared error of the fitted        parabola equation;    -   5. σ_(P), which is the standard deviation of the fitted parabola        equation; and    -   6. k, which is an estimation of curvature.

The classification module 208 in some embodiments includes a layeredclassifier module 210 and/or a CNN module 212. The classification module208 may uniquely identify the objects in the depth image based on theset of layers. In some embodiments, the classification module 208 maycompare the layers associated with each object to a set of stored objectmodels to determine a match. For instance, to determine a matchingobject model, the classification module 208 may compare the geometricproperties of the layers to each of the models. As a further example,using one or more curvatures associated with one or more horizontalslices of a blob, the classification module 208 can determine whichperson or other object the blob corresponds to. Example types of modelscompiled from previous information include, but are not limited to,image galleries, Gaussian mixture models, hidden Markov models, andsupport vector machines.

In some embodiments, the classification module 208 may calculate, foreach object model 128 stored in the storage device 197, a combined valuerepresenting the output of all the layers of the set, which representsthe likelihood of the detected person belonging to that object model.The combined value may represent a recognition score used foridentifying a particular individual or object.

The classification module 208 may be coupled to the layer segmentationmodule 206, the memory 237, the communication unit 245, and/or anothercomponent to receive the set(s) of segmented layers associated with agiven depth image and the geometric properties associated with each ofthe layers. The classification module 208 may be coupled to the storagedevice 197 to retrieve the object models 128. The object models 128 mayrepresent objects that have been trained, registered, and/or otherwisepre-determined, and are detectable and recognizable by the detectionmodule 135. The object models may be manually input, for instance, by anapplicable stakeholder such as the user, an administrator, etc., and/ormay be machine learned using various machine learning technique, such asa probabilistic graphical model (e.g., a Gaussian mixture model). Insome embodiments, Gaussian mixture models (GMM) with a various numbersof mixtures (e.g., 50+) can be trained using manually classified objectsfrom various depth images.

The layered classifier module 210 may perform many of the acts orfunctions described above concerning the classification module 208.Specifically, the layered classifier module 210 may receive the imagedata, such as a depth image and a mask image, and determine thelikelihoods scores for each segmented object in the image. FIG. 15displays an example process 1500 of determining the likelihood scoreusing the layered classifier module 210. At 1502, the image data issegmented by the layer segmentation module 206 as shown in more detailin FIG. 7. Each segment 1508 may have a plurality (e.g., six) geometricdimensional vectors calculated as described with reference to the layersegmentation module 206. At 1504, the layered classifier module 210 mayuse the information provided in the segments to classify each segment asa person GMM or an imposter GMM. At 1506, the layered classifier module210 may sum the log likelihood score for persons (P_(i)) and Imposters(I_(i)) for each segment to generate a likelihood score for a portion ofthe depth image.

The CNN module 212 may receive image data, such as a color image and/ora depth image, from the image processor. The color image may be RGBcolor based, CMYK color based, or another suitable type of colorprocessing performed by the image processor. The CNN module 212 may alsoreceive a mask image for the layer segmentation module 206. The maskimage may include different pixel groups (blobs) in the segments. TheCNN module 212 may construct an image, such as a separate image ormodified version of the color image, by copying the pixels from thecolor image, the depth image, a combination of both images, or anothersuitable image type (the copied pixels having locations corresponding tothe pixel group areas (component areas) of the mask image).

The CNN module 212 in some embodiments uses a deep CNN network. Anon-limiting example of a deep CNN network is AlexNet, which wasoriginally developed by Krizhevesky. The CNN module 212 can uses a deeplearning architecture, such as but not limited to Caffe, to train andevaluate the neural network for future processing.

In some embodiments, the CNN module 212 may have network architecturethat includes convolutional layers, pooling layers, and fully connectedlayers. For example but not limitation, the CNN may include fiveconvolutional layers, three pooling layers, and three fully connectedlayers.

In some embodiments the layers are optimizes to product a certainresult. The CNN module 212 may be configured to generate the likelihoodof whether an image or portion thereof (such as the constructed imagediscussed herein) depicts a person. As a further example, the finallayer of the convolutional neural network in the CNN module 212 maycomprise a certain layer type configured to generate a likelihood thatan object represents a person. One non-limiting example of this is asoft max layer, which allows the AlexNet classifier to generate alikelihood that the constructed image being classified is that of aperson.

The CNN module 212 may classify numerous different categories of objectsin an image database. For example, in a non-limiting embodiment, AlexNetmay be designed to classify one thousand different categories of objectsin the ImageNet database. It should be understood that other suitableand/or compatible neural network architectures may be implemented by theCNN module 212.

The fusion module 214 may receive the likelihood score from the layeredclassifier module 210 and the CNN module 212 and calculates an overalllikelihood score using the two different classifications. One exampleembodiment of the score-level fusion performed by the fusion module 214includes a binary classifier for detecting people. The score-levelfusion method converts the likelihoods to log likelihoods and uses aweighted summation to calculate an overall likelihood score, forexample:

C _(obj) =k ₁ L _(obj) K _(CNN) CNN _(obj).

Summation of log-likelihoods is one possible method of combining the twoscores, other variations to calculate an overall likelihood score asidefrom summation of log-likelihoods are also possible.

In some instances, the detection module 135 may include a registrationmodule and/or training module (not shown) for registering and/ortraining new objects with the detection module 135. During registration,the registration module may capture one or more depth images of theobject and the training module may generate or update an object modelthat describes the various properties of the object including, forinstance, the curvature of the object. A user registering the object mayoptimize the object model by entering and/or adjusting automaticallydetermined information (e.g., curvature information) about the objectvia an associated user interface, such as entering a unique name for theobject (e.g., a person's name), categorizing the object, enteringattributes about the object (e.g., size, weight, color, etc.). In someinstances, the object models may be updated regularly with the mostcurrent, reliable models. For instance, the detection systems 103 and/orusers may upload new object models to the computation server and thecomputation server may push the models to the various other detectionsystems 103 that are coupled to the network 105 for use by the detectionsystems 103.

In an embodiment that utilizes Gaussian mixture models to classify agiven blob, the classification module 208 may determine the likelihoodof a vector belonging to a predetermined Gaussian mixture person model,M, using the equation:

${{p( {\overset{harpoonup}{x},M} )} = {\sum\limits_{i = 1}^{50}\; {\frac{P_{i}}{\sqrt{( {2\pi} )^{6}{\sigma_{i}}}}^{{- {({\overset{harpoonup}{x} - \overset{harpoonup}{\mu}})}^{T}}{\sigma_{i}^{- 1}{({\overset{harpoonup}{x} - \overset{harpoonup}{\mu}})}}}}}},{{{where}\mspace{14mu}\lbrack {\sigma_{i},P_{i},{\overset{harpoonup}{\mu}}_{i}} \rbrack} \in M_{i}}$

The log-likelihood of a new segment, v, belonging to a given objectmodel, OM, may be determined by the following equation:

O({right arrow over (v)})=log(p({right arrow over (v)},OM))−log({rightarrow over (v)},not_OM)),

where the object model, OM, may represent one or more generic orspecific people.

Given N segments in an object sequence, S, each with its own likelihood,the classification module 208 may classify the blob/object by summing upthe log likelihoods and applying a predetermined threshold. A maximumcumulative score can be a reliable indicator of whether or not the setof layers/sequences correspond to a given object model or not. Forinstance, if the score satisfies (e.g., is greater than) a predeterminedthreshold that has been verified as an effective minimum likelihood thatthe object matches the model, then the object can be classified ascorresponding to the person. This helps in the case where incorrectlyclassified segments might cluster together to negatively affect properclassification. For instance, a person blob may include layersassociated with an arm held out parallel to the ground, a dress, and/oran object being, etc., that can cluster together negatively to bringdown the likelihood score that the object is associated with aparticular object model to which it in fact corresponds. In some cases,if a given blob terminates after such a negative cluster of layers, theobject could potentially be incorrectly classified by the classificationmodule 208, in which case, the classification module 208 may considerthe cumulative sum of the log likelihood score.

One significant advantage of the novel layer-based approach discussedherein is that it is tolerant to occlusions. For instance, even if onlya portion of the object is visible in the depth image and, as a result,some layers could not be extracted from the blob (e.g., from the top,bottom, right, left, and/or the middle of the object) and are thereforemissing, the object (e.g., person) can in many cases still be recognizedby the classification module 208 because the object is modeled in itscorresponding object model as a collection/set of layers. For instance,for person recognition, the eyes, nose, or top of head, which arecommonly required by other face and head recognition approaches to alignthe image with the model, do not have to be visible in the depth imagefor the classification module 208 to accurately recognize the person.Additional advantages of the layer-based recognition approach describedherein relative to other approaches are summarized in the table 800 inFIG. 8.

FIG. 3 is a flowchart of an example method 300 for detecting andrecognizing image objects. In block 302, the image processor 202 maydetermine a depth image. In some embodiments, the image processor 202may determine a depth image by receiving the depth image from the sensor155 (e.g., a stereo camera, a structured light camera, a time-of-flightcamera, etc.). In block 304 the image processor 202 may detect an objectblob in the depth image, in block 306 the layer segmentation module 206may segment the object blob into a set of layers, and in block 308 theclassification module 208 may compare the set of layers associated withthe object blob with a set of object models to determine a match.

In some embodiments, in association with determining the set of layers,the layer segmentation module 206 may determine a curvature associatedwith the set of layers and the classification module 208 may evaluatethe curvature using the object models when comparing the set of layerswith the object models to determine the match. Further, in someembodiments, the classification module 208 may compare the set of layerswith the set of object models by determining a likelihood of the objectblob as belonging to each of the object models and determining theobject blob to match a particular object model based on the likelihood.

Next, in block 310, the classification module 208 may recognize theobject associated with the object blob based on the match. For instance,the classification module 208 may determine the identity of the object(e.g., by receiving from the storage device 197 identifying informationfor the object that is stored in association with the matching objectmodel 128). In response to identifying the object, the detection module135 may trigger the operation of a program that performs operationsbased on the identity of the object, such as retrieval of informationassociated with the object, control of one or more output devices (e.g.,displays, speakers, sensors, motivators, etc.) to interact with theobject (e.g., greeting a user using the user's name), pulling up accountinformation associated with the object (e.g., a specific person/user),etc.

FIGS. 4A and 4B are flowcharts of a further example method 400 fordetecting and recognizing image objects. In block 502, the imageprocessor 202 extracts a depth image from the sensor 155 and thenextracts one or more blobs from the depth image in block 404. In someinstances, the image processor 202 may classify the extracted blobs as ahuman/person or other object types (e.g., animal, furniture, vehicle,etc.). For instance, the image processor 202 may detect a plurality ofblobs associated with a plurality of objects depicted by the depthimage, and may classify each of those blobs as a person or other type ofobject based on a shape of the blob as shown in block 406. In someembodiments, if a given blob is not classified into a type that meetsone or more a blob type requirements, the image processor 202 maydiscard that blob from further processing (e.g., layer extraction, blobrecognition, etc.). For instance, as shown in block 408, if none of theblobs reflect people, the method may return to the beginning and repeatuntil a person blob is found. In other embodiments, the method 400 mayskip the classification and filtering operations in blocks 406 and 408.

Next, in block 409, for each blob provided by the image processor 202,the layer segmentation module 206 may transform the blob into a set oflayers and extract one or more geometric properties for each layer ofthe set. Then, in block 412, the classification module 208 may comparethe one or more geometric properties associated with each layer of theset of layers with one or more object models from the storage device197. As discussed elsewhere herein the one or more geometric propertiesmay reflect one or more of a size, curvature, curve fit, and shape fitof that layer. For instance, the one or more geometric propertiesinclude a multi-dimensional vector (e.g., 6D) containing propertiesassociated with layer curvature, as discussed elsewhere herein.

In some embodiments, the method 400 may continue by the classificationmodule 208 computing 414 for each object model a recognition score foreach layer of the set of layers and determining a likelihood (e.g., avalue) that the blob belongs to the object model by aggregating 416(e.g., summing) the layer recognition scores. The method may proceed tocompare the set of layers of each blob to find the best match as shownin block 408 and based on all of the likelihood values, theclassification module 208 may recognize the blob by determining theobject model the blob belongs to. For instance, the object blob may beclassified as a person blob and the classification module 208 mayrecognize the person associated with the person blob based on thematching object model. In some cases, the object model associated withthe highest likelihood may be determined as the matching object model.This determination may in some cases be dependent on the likelihoodvalue satisfying a minimum likelihood threshold.

FIG. 5 is a diagram of an example method 500 for detecting people blobs,slicing the people blobs into layers, and comparing the layers existinguser models to recognize the people associated with the people blobs. Inessence, FIG. 5 describes the flow of information from the depth imageto a recognition score. In block 502, a depth image is extracted 502from the sensor (e.g. stereo camera, structured light camera,time-of-flight camera, etc.). In block 504, blob(s) are extracted fromthe image and classified as person or other type of object. In block506, for each blob is classified as a particular object type (in thiscase a person), that blob is sliced into horizontal layers/slices andone or more geometric properties for each layer is computed. In block508, for each blob, the features from select or all layers are comparedto existing (trained, untrained, etc.) object models to determine if theperson blob belongs to a particular model.

FIG. 11 illustrates an example application of the detection technology.As shown, the detection technology can provide a navigational aid forblind individuals. For instance, a robot equipped with the detectionmodule 135 can help vision impaired people not only move throughdifficult and crowded environments, but also assist in describing theworld around them. The robot companion can detect, recognize, and tracktheir human partner as well as detect and recognize other people in thesurroundings, and can do so while actively moving, navigating aroundobstacles, and recalculating paths to a goal. It can also work insideand outside, under all sorts of lighting conditions, and in manydifferent ranges to the person is trying to track. FIG. 11 in particulardepicts an example blind air robot 1102 leading a human partner 1100through an indoor office environment (e.g., around detected obstaclessuch as a chair or a table). The robot is configured to track people andinvestigate the environment. For instance, the robot can detect thetable and chair and inform the human partner of its existence andlocation should the human partner wish to sit down. In some embodiments,the robot could be configured to constantly communicate the direction ofmotion to the human via a physical communication medium, such as atactile-belt around the individual's waist. The direction of motion 1108may be computed by the robot based on the person's detected location.

By way of further illustration, in the example scenario depicted in FIG.11, envision that the human partner 1100 began behind the detectionsystem 103 (a robot) and followed it around a curve 1108 that passedbetween a large pillar (or right, not depicted) and a waist high countertop 1104. On average, the human partner 1100 maintained approximately a1.5 m distance from the robot. At times, this distance increased to asmuch as 3 m when the robot had to get around the curve and out of theway. To track the person, the sensor 155 was mounted at chest levelfacing to the rear. This sensor recorded RGB and depth images at 30 Hz.The first 1000 images captured had the following characteristics:

-   -   890 images containing part of a person;    -   719 images with visible shoulders and/or eyes;    -   171 images with only a partially visible person missing eyes and        at least one shoulder (e.g. off the side of the image); and    -   331 images with 2 people visible or partially visible.

Examples of these images are depicted in FIGS. 12A, 12B, and 12C. Inparticular, FIG. 12A depicts a common close-range view of the humanpartner. The head and legs are missing because the sensor 155 was notwide enough in angle to capture everything at that close range. Theimage in FIG. 12B was also very common. In addition to the humanpartner's face being partially off screen, only one shoulder is visibleas the human partner strayed to the side of the path, which is alsorepresentative of being blocked by various other vertical obstructions,like pillars, wall corners, and doors. The image in FIG. 12C depicts twopeople in frame.

The following table shows example statistics collected during theexample robot guidance scenario.

Performance of the detection technology on the guidance scenariodataset. False Missing False False Negative 2^(nd) Positive Negative(Partial Body) Person Speed Polyfit, 14/1000 0/719 43/171 195/331 >30 HzGMM (0.1%) (0%) (25.1%) (58.9%)

In this scenario, the human partner was detected in every full frontalframe, and was missed in only 25% of the blobs with horizontalocclusions. A second person that also appeared in frame was detected 41%of the time, without a significant increase in false positive rate. Thetraining data used for this scenario is described below.

During this scenario, another people detection algorithm was evaluatedside-by-side with the detection technology on two larger data sets: 1)moving people and/or robot—evaluated with Microsoft Kinect and a stereovision system; and 2) people at different relative rotations—evaluatedin two locations: a) inside; and 2) in direct sun with the stereo visionsystem.

For 1), a total of 14,375 images were collected using the rear facingKinect sensor on the robot. They were collected over 7 different runsranging in duration from two minutes to 15 minutes. The two shortestruns involved simply moving around in front of the camera. The remainingruns involve the scenario where a person followed the robot through anenvironment. All but one run contained at least two people. In order totrain the detection system, a human trainer manually input examples ofpeople and other objects in the image set. Objects were tracked betweensuccessive frames by tabulating a blob similarity score to limit therequired input from the trainer. 5736 positive examples, and 9197negative examples were identified in this fashion using the Kinectdataset.

Additional data was collected using a stereo image system (Point GreyXB3) mounted in place of the Kinect. For computational efficiency, blockmatching with an 11×11 pixel window size was utilized to identifydisparity between frames. In addition to testing the robot indoors, therobot was also taken outside in both sunlight and shady areas. A totalof 7778 stereo images were collected over five trials. 5181 positiveexamples, and 2273 negative examples were collected in this manner.Because of the increased noise in the stereo data, fewer objects crossedan example minimum threshold of 500 pixels to be considered for peopledetection.

In the above scenario, the detection technology decreased the falsenegative rate to 0% with horizontal occlusions, without increasing thefalse negative rate for vertical occlusions or decreasing the speed. Inmore difficult, larger scenarios with more than one moving person, aperformance improvement of more than 5% was achieved. This performancedifference was demonstrated with various sensors both inside andoutside. The addition of depth-based blob tracking across sequentialframes can even further improve the percentage of time people aredetected by the robot.

FIGS. 10A and 10B are graphs showing an example comparison between twodifferent types of sensors. In particular, FIG. 10A illustrates the rowlevel accuracy for people detection using Microsoft Kinect and FIG. 10Billustrates the row level accuracy for people detection using the stereovision system. FIG. 10A includes performance estimates for threedifferent algorithms: (1) the Spinello algorithm; (2) the set ofproperties with a linear classifier (Polyfit—Linear); and (3) the set ofproperties with a GMM (Polyfit—GMM). FIG. 10A includes performanceestimates for two different algorithms: (1) the Spinello algorithm; and(3) the set of properties with a GMM (Polyfit—GMM).

With reference to FIG. 10A, when examined on a segment by segment, orrow level, basis, there is little difference in ROC between thePolyfit-Linear curve and the Spinello curve. Using the novel, smallerset of geometric properties (e.g., 6D vector) is comparable without theadditional computational overhead. However, with the addition of the GMMto the new features, the detection technology described herein performssignificantly better and, at an estimated 3% false positive rate,provides a 3.6% improvement in the true positive rate. Over the entireROC curve, it provides a 2.5% increase in the area.

With reference to FIG. 10B, there is less difference between featuresets, which may be due, in part, to the filtering inherent to the blockmatching disparity calculations, which can round edges and widen holesduring blob extraction. However, FIG. 10C is a graph showing ablob-level comparison between the novel technology described herein andanother alternative. In this figure, a blob-level comparison of ROCcurves for Polyfit—GMM and the Spinello algorithm demonstrates morepronounced performance improvement with the stereo vision system usingthe detection technology described herein. In particular, the use ofPolyfit—GMM with the set of geometric properties bumps up performance inthe critical region from 0-10% false positive rate. At a 3% falsepositive rate, the new algorithm achieves 83.4% true positive rate vs.the Spinello algorithm's 77.8%.

In the above example scenario, the people objects were moving, or therobot was moving, or both moving through the environment duringrecognition, and as a result, a majority of the people detected werefacing the camera and were generally occluded to some degree. In asecond example scenario, the following set of example data demonstratesthe effectiveness of people detection at different relative orientationsto the camera, which is generally an important aspect of human robotinteraction.

In this second scenario, a group of 29 different people stood androtated in place in front of the sensor 155 while being captured. 20people (15 men and 5 women) in the group were evaluated in an interiorroom with no windows and 24 people (17 men and 7 women) in the groupwere evaluated in front of floor to ceiling windows with full sun. 14people participated in both environments. The sensor 155 used for thisexperiment included the stereo vision system. Detection models weretrained using the data set from the guidance scenario described herein,in which a total of six people, all male, were present in training dataset. The following table summarizes the results of the second scenario.

False negative rates for interior room vs. window, for each gender. MaleFemale Overall Interior 2/1893 82/533 3.5% Room (0.1%) (15.4%) Next to107/2208 193/764 10.5% Window (3.8%) (25.3%)

While there are differences between detection rates for the two types oflighting conditions, which is likely due to the effects of full sun onstereo disparity calculations, the detection technology correctlyidentified the male persons more than 95% of the time (e.g., 99.9% ininterior room and 95.2% near window). Because there were no women in thetraining data, the rates for correctly identifying women were lower.However, even without training, 84.6% of the women were correctlyidentified in the interior room and 74.7% of the women were correctlyidentified when next to the window. The false negative identificationswere dominated by women of slight build and/or long hair (i.e., withless curvature), which could be could be improved by broadening thetraining set.

FIG. 13 is a flow chart of an example method 1300 of recognizing imageobjects. At 1302, the image processor 202 receives image data. The imagedata may various images and information be captured by sensors 155. Forexample, in one embodiment, a robot may be moving through an officespace and a camera sensor 155 may be capturing a depth image and a colorimages as the robot moves through the space. At 1304, the layersegmentation module 206 creates mask images of the image data bysegmenting the image data into a plurality of components as describedabove with reference to FIGS. 5, 7, and 15, for example.

At 1306, the layered classifier module 210 may determine a likelihoodscore from the image data received from the image processor 202 and themask image data received from the layered segmentation module 206. Thelikelihood score calculated by the layered classifier module 210 may bebased on both the depth image and the mask image. At 1308, the CNNmodule 212 may determine a likelihood score from the image data receivedfrom the image processor 202 and the mask image data received from thelayered segmentation module 206. The likelihood score calculated by theCNN module 212 may be based on both the color image and the mask image,the depth image and the mask image, a combination of both, etc. In someembodiments, the CNN module 212 may generate an object image by copyingpixels from the first image of the components in the mask image andclassify the object image using the deep convolutional neural networkwithin the CNN module 212. At 1310, the fusion module 214 may determinea class (e.g., person class or imposter class) of at least a portion ofthe image data received by the image processor 202, based on thelikelihood score from the layered classifier module 210 and thelikelihood score of the CNN module 212.

For example, using this method with reference to FIG. 6, the robotmoving through the office may capture an image of a two people 606, atable 604, and a chair 608. The image data may include color data anddepth data and the information may be segmented and provided to thedetection module 135 for classification. By using the color data, theblobs may be grouped by the image processor 202 using different pixelcolors to group different features, The dark images of the objects inthe image 602 may be recognized by the image processor 202. The layersegmentation module 206 may segment the image and send the differentblob segments to the classification module 208. Using both the maskimage created by the layer segmentation module 206 and the depth dataand color data, the image 602 can be more precisely classified using acombined score from the fusion module 214.

FIG. 14A is a flow chart 1400 of a further example method of recognizingimage objects. At 1402, the image processor 202 receives image data. Theimage data may various images and information be captured by sensors155. For example, in one embodiment, a robot may be moving through anoffice space and a camera sensor 155 may capture a depth image and acolor images as the robot moves through the space. At 1404, the imageprocessor 202 may extract a depth image and a color image from the imagedata. For example, sensor 155 may capture depth information and/or colorinformation of the environment and the image processor 202 may extractthe relevant image data captured by the sensor 155.

At 1406, the layer segmentation module 206 creates mask images of theimage data by segmenting the image data into a plurality of componentsas described above with reference to FIGS. 5, 7, and 15. At 1408, thelayered classifier module 210 may determine a likelihood score from thedepth image data received from the image processor 202 and the maskimage data received from the layered segmentation module 206. Thelikelihood score calculated by the layered classifier module 210 may bebased on both the depth image and the mask image. At 1410, the CNNmodule 212 may determine a likelihood score from the color imagereceived from the image processor 202 and the mask image data receivedfrom the layered segmentation module 206. The likelihood scorecalculated by the CNN module 212 may be based on both the color imageand the mask image. In alternative embodiments, the CNN module 212 mayreceive a depth image and a mask image and calculate a second likelihoodscore using those images. At 1412, the fusion module 214 may determine aclass (i.e. person class or imposter class) of at least a portion of theimage data based on the likelihood score from the layered classifiermodule 210 and the likelihood score of the CNN module 212.

FIG. 14B is a flowchart 1410 of a further example of determining thesecond likelihood score. At 1416, the layered classifier module 210receives image data including a mask image and pre-filters the maskimage using the layered classifier module 210. In some embodiments, thelikelihood score (i.e. classifications) from the layered classifiermodule 210 may be used to create the pre-filtered mask image. In someembodiments, the mask image may be generated by the layer segmentationmodule 206, prior to the layered classifier module 210 receiving themask image at step 1402. At 1418, the CNN module 212 may determine alikelihood score from the color image and the pre-filtered mask imagegenerated by the layered classifier module 210 at step 1416. In thisembodiment, the layered classifier module 210 may pre-filter the maskimage before sending the mask image to the CNN module 212. Pre-filteringthe mask image decreases the processing time of the deep convolutionalneural network as shown with reference to FIG. 19. At block 1414, theCNN module 212 receives a color image and mask image (that has not beenpre-filtered) and determines a second likelihood score in parallel withthe layered classifier module 210 determining a first likelihood score.

FIG. 16 is a block diagram of an example image classification system1600. At 1602, a camera node receives data and extracts a depth imageand a color image. The data may be captured by sensor 155 which may bepart of the camera node 1602 or may be captured previously and sent tothe camera node 1602. The image processor 202 of the detection module135 may be included in the camera node 1602.

The segmentation node 1604 may include the layer segmentation module 206and may use the layer segmentation module 206 to create a mask image outof the depth image received from the camera node 1602. The layeredclassifier 1606 may include the layered classifier module 210 and mayuse the depth image and the mask image to calculate a class likelihoodthat an object in the image data is a person.

In some embodiments, the layered classifier 1606 may also pre-filter themask image from the segmentation node 1604 to decrease the processingtime of the deep convolutional neural network in the CNN classifier1608. The CNN classifier 1608 may include the CNN module 212. In someembodiments, the CNN classifier 1608 receives the mask image from thesegmentation node 1605, in alternative embodiments; the CNN classifier1608 receives the pre-filter mask image from the layered classifier 1606to decrease processing time. The CNN classifier 1608 uses the colorimage and either the mask image or the pre-filtered mask image tocalculate a class likelihood that an object in the image data is aperson.

In further embodiments, the CNN classifier 1608 may receive a depthimage or other suitable image type instead of/in addition to a colorimage and perform a deep convolutional network algorithm on that imagedata instead of/in addition to the color image, along with the maskimage. The fusion node 1610 may include the fusion module 214 and mayreceive the class likelihood scores from the layered classifier 1606 andthe CNN classifier 1608. The fusion node 1610 may combine the likelihoodscores to create an overall likelihood score. In some embodiments, thefusion node 1610 may further receive the mask image from thesegmentation node 1604 for further processing.

FIG. 17 displays data in the form of areas under curves for fourdifferent evaluation sets of three different sensor types in threedifferent environments. In FIG. 17A, the graph 1702 displays theperformance of seven different algorithms for image recognition,including the overall score generated by the CNN color+layered algorithmthat is created using the method described in FIG. 13. As can be seen,using a structured light sensor in an open lab space, theLayered+CNN-RGB has the highest combined score for object recognitioncompared to the six other common recognition algorithms. In the graph1704, the seven algorithms were tested using a structured light sensorin a home environment. In this test, the Layered+CNN-RGB has the highestscore. The layered score combined with a combination neural networkscore based on the depth image also performed generally better relativeto other methods in these tests.

In FIG. 17B, table 1706 displays data from the test of the sevenalgorithms using the four different sensors. In the four differenttests, the fusion scores for object recognition of the algorithms thatused both a layered classifier and a CNN classifier performed betterthan the other algorithms that did not include a combined overall score.In the structured light sensors tests and the stereo sensor test thelayered classifier combined with the CNN classifier of the color image(RGB) had the highest object recognition scores. In the time of flightcamera sensor test, the layered classifier combined with the CNNclassifier of the depth image had the highest object recognition score.

In FIG. 17C, the graph 1706, displays test data of the seven differentalgorithms using a stereo camera sensor in an office environment. TheLayered+CNN RGB algorithm has the best performance in this testenvironment as well. In graph 1708, the seven different algorithms weretested using a time-of-flight camera in a home environment. In thistest, the fusion of the layered classifier and CNN classifieroutperformed the other algorithms.

FIG. 18 is a block diagram 1800 of an image classification device usingpre-filter layered classifiers. The camera node 1802 may include sensor155 for capturing data. The camera node 1802 may also include the imageprocessor 202 for processing image data either captured or received. Theimage processor 202 may process the image data to extract a depth imageand/or a color image. The depth image may be sent to thesegmentation/tracking+layered classifier node 1804.

The segmentation/tracking+layered classifier node 1804 may include thelayer segmentation module 206 and/or the layered classifier module 210.The segmentation/tracking+layered classifier node 1804 segments thedepth image and classifies the objects into a layered class result thatmay be sent to the fusion node. The segmentation/tracking+layeredclassifier node 1804 also generates an image mask that is thenpre-filtered and sent to the CNN classifier node 1806 for classificationusing the deep convolutional neural network.

The segmentation/tracking+layered classifier node 1804 also may trackblob positions in the different sections and provide the blob positioninformation to the fusion node 1808. The CNN classifier node 1806 mayreceive the pre-filter mask image and the color image for use inclassifying an object in the image data. In some embodiments, the CNNclassifier node 1806 may alternatively receive a depth image from thecamera node 1802 and use the depth image and the pre-filtered mask imageto classify and object in the image data. The fusion node 1808 receivesthe CNN class results and the layered class results and calculates anoverall combined likelihood that an object in the image data is aperson. In some embodiments, the fusion node 1808 may also receive thetracked blob positions for further processing or for inclusion in thecalculation of the combined likelihood score.

FIG. 19 displays data in the form of graphs of precision data vs. recallcurves of three different situations. The algorithms used include apre-filtered CNN (alexnet)+layered classifier, a CNN+layered classifierwithout the pre-filter, and various CNN and layered classifier withoutthe fusion of the two combined classifiers. In graph 1902, the test wasof a person falling and the precision of the recognition versus therecall time to classify the object as a person. In this system, thepre-filtered algorithm was somewhere in the middle, while the time wassignificantly greater than the layered system without the pre-filter. Ingraph 1904, the test was of a person sitting on a bed. The pre-filteredalgorithm may in some cases sacrifice precision for improvedcomputational speed compared to the fused system. In graph 1906, thetest was done on a robot moving around the office, in this test, thepre-filtered algorithm shows a higher precision and a higher recallspeed on the curve then the other algorithms.

In the above description, for purposes of explanation, numerous specificdetails are set forth in order to provide a thorough understanding ofthe present disclosure. However, it should be understood that thetechnology described herein could be practiced without these specificdetails. Further, various systems, devices, and structures are shown inblock diagram form in order to avoid obscuring the description. Forinstance, various implementations are described as having particularhardware, software, and user interfaces. However, the present disclosureapplies to any type of computing device that can receive data andcommands, and to any peripheral devices providing services.

In some instances, various implementations may be presented herein interms of algorithms and symbolic representations of operations on databits within a computer memory. An algorithm is here, and generally,conceived to be a self-consistent set of operations leading to a desiredresult. The operations are those requiring physical manipulations ofphysical quantities. Usually, though not necessarily, these quantitiestake the form of electrical or magnetic signals capable of being stored,transferred, combined, compared, and otherwise manipulated. It hasproven convenient at times, principally for reasons of common usage, torefer to these signals as bits, values, elements, symbols, characters,terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. Unlessspecifically stated otherwise as apparent from the following discussion,it is appreciated that throughout this disclosure, discussions utilizingterms including “processing,” “computing,” “calculating,” “determining,”“displaying,” or the like, refer to the action and processes of acomputer system, or similar electronic computing device, thatmanipulates and transforms data represented as physical (electronic)quantities within the computer system's registers and memories intoother data similarly represented as physical quantities within thecomputer system memories or registers or other such information storage,transmission or display devices.

Various implementations described herein may relate to an apparatus forperforming the operations herein. This apparatus may be speciallyconstructed for the required purposes, or it may comprise ageneral-purpose computer selectively activated or reconfigured by acomputer program stored in the computer. Such a computer program may bestored in a computer readable storage medium, including, but is notlimited to, any type of disk including floppy disks, optical disks, CDROMs, and magnetic disks, read-only memories (ROMs), random accessmemories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, flashmemories including USB keys with non-volatile memory or any type ofmedia suitable for storing electronic instructions, each coupled to acomputer system bus.

The technology described herein can take the form of a hardwareimplementation, a software implementation, or implementations containingboth hardware and software elements. For instance, the technology may beimplemented in software, which includes but is not limited to firmware,resident software, microcode, etc. Furthermore, the technology can takethe form of a computer program product accessible from a computer-usableor computer-readable medium providing program code for use by or inconnection with a computer or any instruction execution system. For thepurposes of this description, a computer-usable or computer readablemedium can be any non-transitory storage apparatus that can contain,store, communicate, propagate, or transport the program for use by or inconnection with the instruction execution system, apparatus, or device.

A data processing system suitable for storing and/or executing programcode may include at least one processor coupled directly or indirectlyto memory elements through a system bus. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories that provide temporary storage of at leastsome program code in order to reduce the number of times code must beretrieved from bulk storage during execution. Input/output or I/Odevices (including but not limited to keyboards, displays, pointingdevices, etc.) can be coupled to the system either directly or throughintervening I/O controllers.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems,storage devices, remote printers, etc., through intervening privateand/or public networks. Wireless (e.g., Wi-Fi™) transceivers, Ethernetadapters, and modems, are just a few examples of network adapters. Theprivate and public networks may have any number of configurations and/ortopologies. Data may be transmitted between these devices via thenetworks using a variety of different communication protocols including,for example, various Internet layer, transport layer, or applicationlayer protocols. For example, data may be transmitted via the networksusing transmission control protocol/Internet protocol (TCP/IP), userdatagram protocol (UDP), transmission control protocol (TCP), hypertexttransfer protocol (HTTP), secure hypertext transfer protocol (HTTPS),dynamic adaptive streaming over HTTP (DASH), real-time streamingprotocol (RTSP), real-time transport protocol (RTP) and the real-timetransport control protocol (RTCP), voice over Internet protocol (VOIP),file transfer protocol (FTP), WebSocket (WS), wireless access protocol(WAP), various messaging protocols (SMS, MMS, XMS, IMAP, SMTP, POP,WebDAV, etc.), or other known protocols.

Finally, the structure, algorithms, and/or interfaces presented hereinare not inherently related to any particular computer or otherapparatus. Various general-purpose systems may be used with programs inaccordance with the teachings herein, or it may prove convenient toconstruct more specialized apparatus to perform the required methodblocks. The required structure for a variety of these systems willappear from the description above. In addition, the specification is notdescribed with reference to any particular programming language. It willbe appreciated that a variety of programming languages may be used toimplement the teachings of the specification as described herein.

The foregoing description has been presented for the purposes ofillustration and description. It is not intended to be exhaustive or tolimit the specification to the precise form disclosed. Manymodifications and variations are possible in light of the aboveteaching. It is intended that the scope of the disclosure be limited notby this detailed description, but rather by the claims of thisapplication. As will be understood by those familiar with the art, thespecification may be embodied in other specific forms without departingfrom the spirit or essential characteristics thereof. Likewise, theparticular naming and division of the modules, routines, features,attributes, methodologies and other aspects are not mandatory orsignificant, and the mechanisms that implement the specification or itsfeatures may have different names, divisions and/or formats.

Furthermore, the modules, routines, features, attributes, methodologiesand other aspects of the disclosure can be implemented as software,hardware, firmware, or any combination of the foregoing. Also, wherevera component, an example of which is a module, of the specification isimplemented as software, the component can be implemented as astandalone program, as part of a larger program, as a plurality ofseparate programs, as a statically or dynamically linked library, as akernel loadable module, as a device driver, and/or in every and anyother way known now or in the future. Additionally, the disclosure is inno way limited to implementation in any specific programming language,or for any specific operating system or environment.

What is claimed is:
 1. A computer-implemented method for performingobject recognition comprising: receiving image data; extracting a depthimage and a color image from the image data; creating a mask image bysegmenting the image data into a plurality of components; identifyingobjects within the plurality of components of the mask image;determining a first likelihood score from the depth image and the maskimage using a layered classifier; determining a second likelihood scorefrom the color image and the mask image by generating an object image bycopying pixels from the first image of the components in the mask imageand classifying the object image using the deep convolutional neuralnetwork (CNN); and determining a class for at least a portion of theimage data based on the first likelihood score and the second likelihoodscore.
 2. A computer-implemented method for performing objectrecognition comprising: receiving image data; creating a mask image bysegmenting the image data into a plurality of components; determining afirst likelihood score from the image data and the mask image using alayered classifier; determining a second likelihood score from the imagedata and the mask image using a deep convolutional neural network (CNN);and determining a class for at least a portion of the image data basedon the first likelihood score and the second likelihood score.
 3. Thecomputer-implemented method of claim 2, wherein the determining thesecond likelihood score from the image data and the mask image using thedeep convolutional neural network includes: extracting a first imagefrom the image data; generating an object image by copying pixels fromthe first image of the components in the mask image; classifying theobject image using the deep CNN; generating classification likelihoodscores indicating probabilities of the object image belonging todifferent classes of the deep CNN; and generating the second likelihoodscore based on the classification likelihood scores.
 4. Thecomputer-implemented method of claim 3, wherein the first image is oneof a color image, a depth image, and a combination of a color image anda depth image.
 5. The computer-implemented method of claim 2, whereindetermining the class of at least the portion of the image dataincludes: fusing the first likelihood score and the second likelihoodscore into an overall likelihood score; and responsive to satisfying apredetermined threshold with the overall likelihood score, classifyingthe at least the portion of the image data as representing a personusing the overall likelihood score.
 6. The computer-implemented methodof claim 2, further comprising: extracting a depth image and a colorimage from the image data, wherein determining the first likelihoodscore from the image data and the mask image using the layeredclassifier includes determining the first likelihood score from thedepth image and the mask image using the layered classifier, anddetermining the second likelihood score from the image data and the maskimage using the deep CNN includes determining the second likelihoodscore from the color image and the mask image using the deep CNN.
 7. Thecomputer-implemented method of claim 2 wherein the deep CNN has a softmax layer as a final layer to generate the second likelihood that the atleast the portion of the image data represents a person.
 8. Thecomputer-implemented method of claim 2, further comprising: convertingthe first likelihood score and the second likelihood score into a firstlog likelihood value and a second log likelihood value; and calculatinga combined likelihood score by using a weighted summation of the firstlog likelihood value and the second log likelihood value.
 9. Thecomputer-implemented method of claim 2, wherein the class is a person.10. The computer-implemented method of claim 2, wherein determining thesecond likelihood score further comprises: determining the secondlikelihood score using the image data and the first likelihood scorefrom the layered classifier.
 11. A system for performing objectrecognition comprising: a processor: and a memory storing instructionsthat, when executed, cause the system to: create a mask image bysegmenting the image data into a plurality of components; determine afirst likelihood score from the image data and the mask image using alayered classifier; determine a second likelihood score from the imagedata and the mask image using a deep convolutional neural network (CNN);and determine a class for at least a portion of the image data based onthe first likelihood score and the second likelihood score.
 12. Thesystem of claim 11, wherein the instructions that cause the system todetermine the second likelihood score from the image data and the maskimage using the deep convolutional neural network further include:extract a first image from the image data; generate an object image bycopying pixels from the first image of the components in the mask image;classify the object image using the deep CNN; generate classificationlikelihood scores indicating probabilities of the object image belongingto different classes of the deep CNN; and generate the second likelihoodscore based on the classification likelihood scores.
 13. The system ofclaim 12, wherein the first image is one of a color image, a depthimage, and a combination of a color image and a depth image.
 14. Thesystem claim 11, wherein the instruction that cause the system todetermine the class of at least the portion of the image data include:fuse the first likelihood score and the second likelihood score into anoverall likelihood score; and responsive to satisfying a predeterminedthreshold with the overall likelihood score, classify the at least theportion of the image data as representing a person using the overalllikelihood score.
 15. The system of claim 11, wherein the memory storesfurther instruction that cause the system to: extract a depth image anda color image from the image data, wherein determining the firstlikelihood score from the image data and the mask image using thelayered classifier includes determining the first likelihood score fromthe depth image and the mask image using the layered classifier, anddetermining the second likelihood score from the image data and the maskimage using the deep CNN includes determining the second likelihoodscore from the color image and the mask image using the deep CNN. 16.The system of claim 11 wherein the deep CNN has a soft max layer as afinal layer to generate the second likelihood that the at least theportion of the image data represents a person.
 17. The system of claim11, wherein the memory stores further instruction that cause the systemto: convert the first likelihood score and the second likelihood scoreinto a first log likelihood value and a second log likelihood value; andcalculate a combined likelihood score by using a weighted summation ofthe first log likelihood value and the second log likelihood value. 18.The system of claim 11, wherein the class is a person.
 19. The system ofclaim 11, wherein the instructions that cause the system to determinethe second likelihood score further comprises: pre-filter the mask imageusing the layered classifier; and determine the second likelihood scoreusing the image data and the pre-filtered mask image.
 20. The system ofclaim 11, wherein the layered classifier determines the first likelihoodscore using a Gaussian mixture.